Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights
نویسندگان
چکیده
This paper presents a framework for designing a hidden Markov model (HMM)-based audio-visual automatic speech recognition (ASR) system based on minimum classification error training. Audio/visual HMM parameters are optimized with the generalized probabilistic descent (GPD) method, and their likelihoods are combined using model-dependent stream weights which are also estimated with the GPD method. Experimental results of speaker independent isolated word recognition show that the audiovisual ASR performance is significantly improved by the GPD optimization of audio and visual HMMs and the introduction of model-dependent stream weights, resulting in 47 % – 81 % error reduction over a conventional system which consists of HMMs trained based on the maximum likelihood criterion and globally-tied stream weights estimated with the GPD method.
منابع مشابه
Improvement of Audio-visual Speech Recognition in Cars
For multi-stream HMMs which are used to effectively combine acoustic and visual information, it is important to optimize stream weights automatically and properly in order to improve the performance. This paper proposes a new stream-weight optimization method based on a likelihood-ratio maximization criterion, in which the difference of log likelihood values between the first and other hypothes...
متن کاملWeighting and normalisation of synchronous HMMs for audio-visual speech recognition
In this paper, we examine the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition. Rather than considering the stream weights to be the same for training and testing, we examine the effect of different stream weights for each task on the final speech-recognition performance. Evaluating our system under varying levels o...
متن کاملAsynchrony modeling for audio-visual speech recognition
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various de...
متن کاملStream weight estimation for multistream audio-visual speech recognition in a multispeaker environment
The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment. The paper follows a conventional multistream approach and examines the specific problem of estimating reliable timevarying audio and visual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR) – and hence audio stream wei...
متن کاملViseme-dependent weight optimization for CHMM-based audio-visual speech recognition
The aim of the present study is to investigate some key challenges of the audio-visual speech recognition technology, such as asynchrony modeling of multimodal speech, estimation of auditory and visual speech significance, as well as stream weight optimization. Our research shows that the use of viseme-dependent significance weights improves the performance of state asynchronous CHMM-based spee...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000